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Abstract 



Motivated by information retrieval applications, we consider the one-dimensional colored range 
reporting problem in rank space. The goal is to build a static data structure for sets G\, . . . , C m C 
{1, . . . , a} that supports queries of the kind: Given indices a, b, report the set U Q <i<h 

We study the problem in the I/O model, and show that there exists an optimal linear-space data 
structure that answers queries in 0(1 + k/B) I/Os, where k denotes the output size and B the disk block 
size in words. In fact, we obtain the same bound for the harder problem of three-sided orthogonal range 
reporting. In this problem, we are to preprocess a set of n two-dimensional points in rank space, such 
that all points inside a query rectangle of the form [x\ , X2] x (— 00, y) can be reported. The best previous 
bounds for this problem is either 0(n lg 2 B n) space and 0(1 + k/B) query I/Os, or 0(n) space and 
0(lg^ n + k/B) query I/Os, where lg^ -* n is the base B logarithm iterated h times, for any constant 
integer h. The previous bounds are both achieved under the indivisibility assumption, while our solution 
exploits the full capabilities of the underlying machine. Breaking the indivisibility assumption thus 
provides us with cleaner and optimal bounds. 

Our results also imply an optimal solution to the following colored prefix reporting problem. Given 
a set S of strings, each O(l) disk blocks in length, and a function c : S — > 2^' ■■■'"}, support queries of 
the kind: Given a string p, report the set UxeSnp* c ( x )' where p* denotes the set of strings with prefix p. 
Finally, we consider the possibility of top-fc extensions of this result, and present a simple solution in a 
model that allows non-blocked I/O. 
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1 Introduction 



A basic problem in information retrieval is to support prefix predicates, such as datab*, that match all 
documents containing a string with a given prefix. Queries involving such a predicate are often resolved by 
computing a list of all documents satisfying it, and merging this list with similar lists for other predicates 
(e.g. inverted indexes). Recent overviews can be found in e.g. |29l [T2l . To our best knowledge, existing 
solutions either require super-linear space (e.g. storing all answers) or report a multi-set, meaning that the 
same document may be reported many times if it has many words with the given prefix. In range reporting 
terminology we are interested in the colored reporting problem, where each color (document) may match 
several times, but we are only interested in reporting each color once. 

A related problem is that of query relaxation. When no answers is produced on a query for some given 
terms, an information retrieval system may try to "relax" some of the search conditions to produce near- 
matches. For example, a search for "colour television" may be relaxed such that also documents containing 
"color" and/or "TV" are regarded as matches, or further relaxed to also match "3D" and "screen". In general, 
terms may be arranged as leaves in a tree (possibly with duplicates), such that inner nodes correspond to 
a natural relaxation of the terms below it. When answering a query we are interested in the documents 
containing a term in some subtree. Again, this reduces to a colored ID range reporting problem. 

1.1 Model of Computation 

In this paper we study the above problems in the I/O-model [6] of computation. In this model, the input 
to a data structure problem is assumed too large to fit in main memory of the machine, and thus the data 
structure must reside on disk. The disk is assumed infinite, and is divided into disk blocks, each consisting 
of B words of 6(lg n) bits each. We use b = Q(B lg n) to denote the number of bits in a disk block. 

To answer a query, the query algorithm repeatedly reads disk blocks into the main memory (an I/O) of 
size M words, and based on the contents of the read blocks it finally reports the answer to the query. The 
cost of answering a query is measured in the number of I/Os performed by the query algorithm. 

The indivisibility assumption. A common assumption made when designing data structures in the I/O- 
model is that of indivisibility. In this setting, a data structure must treat input elements as atomic entities, and 
has no real understanding of the bits constituting the words of a disk block. There is one main motivating 
factor for using this restriction: It makes proving lower bounds significantly easier. This alone is not a 
strong argument why upper bounds should be achieved under the indivisibility assumption, but it has long 
since been conjectured that for most natural problems, one cannot do (much) better without the indivisibility 
assumption. In this paper, we design data structures without this assumption, which allows us to achieve 
optimal bounds for some of the most fundamental range searching problems, including the problems above. 

1.2 Our contributions 

In this paper we present the first data structures for colored prefix reporting queries that simultaneously have 
linear space usage and optimal query complexity in the I/O model |6]. More precisely, our data structure 
stores sets C\,..., C m C {1, . . . , cr} and supports queries of the kind: Given indices a, b, report the set 
U a <j<6 If tne reported set has size k, then the number of I/Os used to answer the query is 0(1 + k/B). 
If we let n = YJiL\ [C*| denote the total data size, then the space usage of the data structure is 0(n) words, 
i.e. linear in the size of the data. 
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In fact, in Section [2] we present an optimal solution to the harder and very well-studied range searching 
problem of three-sided orthogonal range reporting in two-dimensional rank-space, and then use a known 
reduction |fl9l to get the above result (Section |3). Given a set S of n points from the grid [n] x [n] = 
{1, . . . , n} x {1, . . . , n}, this problem asks to construct a data structure that is able to report all points inside 
a query rectangle of the form [xi,X2] X ( — oo,y]. We consider the static version of the problem, where 
updates to the set S are not required. We note that optimal solutions to this problem have been known for 
more than a decade on the word-RAM, but these solutions are all inherently based on random accesses. The 
query time of these data structures thus remains 0(l + k) when analysed in the I/O-model, which falls short 
of the desired 0(1 + k/B) query cost. 

One of the key ideas in obtaining our results for three-sided range reporting is an elegant combination 
of Fusion Trees and tabulation, allowing us to make a dynamic data structure partially persistent, free of 
charge, when the number of updates to the data structure is bounded by b°^. We believe several of the 
ideas we develop, or variations thereof, may prove useful for many other natural problems. 

Finally, in Section|4]we consider the top-A; valiant, where we only need to report the first k colors in the 
set, and A; is a parameter of the data structure. In a scatter I/O model that allows B parallel word accesses 
we get a data structure with optimal query complexity also for this problem. This data structure uses space 
0(n + mB), which is 0(n) if the average size of C\, . . . , C m is at least B. Since the scatter I/O model 
abstracts the parallel data access capabilities of modern hardware, we believe that it is worth investigating 
further, and our results here may be a first step in this direction. 

Notation. We always use n to denote the size of data in number of elements. For colored prefix reporting 
this means that we have m subsets of {1, . . . , a} of total size n. As mentioned, we use b to denote the 
number of bits in a disk block. Since a block may store B pointers and input elements, we make the natural 
assumption that b = Q(B lg n), i.e., each disk block consists of B words, where a word is 0(lg n) bits. 

1.3 Related work 

The importance of three-sided range reporting is mirrored in the number of publications on the problem, see 
e.g. 1H1I171 for the pointer machine, dS |9] [TT] [3] |4) for the cache-oblivious and ll8ll26llT4l for the word-RAM 
model. One of the main reason why the problem has seen so much attention stems from the fact that range 
searching with more than three sides no longer admits linear space data structures with polylogarithmic 
query cost and a linear term in the output size. Thus three-sided range searching is the "hardest" range 
searching problem that can be solved efficiently both in terms of space and query cost, and at the same time, 
this has proved to be a highly non-trivial task. 

The best I/O model solution to the three-sided range reporting problem in 2-d, where coordinates can 
only be compared, is due to Arge et al. iTTOl . Their data structure uses linear space and answers queries 
in 0(lg B n + k/B) I/Os. This is optimal when coordinates can only be compared. Nekrich reinvestigated 
the problem in a setting where the points lie on an integer grid of size U x U. He demonstrated that for 
such inputs, it is possible to achieve 0(lg 2 lg# U + k/B) query cost while maintaining linear space [23 1 . 
This bound is optimal by a reduction from predecessor search. His solution relies on indirect addressing 
and arithmetic operations such as multiplication and division. In the same paper, he also gave several data 
structures for the case of input points in rank space. The first uses linear space and answers queries in 
0(lg^ n + k/B) I/Os, where \g B n is the base B logarithm iterated h times, for any constant integer h. 
The second uses 0(n lg* B n) space and answers queries in 0(lgjg n + k/B) I/Os, and the final data structure 
uses 0(n \g 2 B n) space and answers queries in optimal 0(1 + k/B) I/Os. All these data structures use only 
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comparisons and indirect addressing. 

Higher-dimensional orthogonal range reporting has also received much attention in the I/O model, see 
e.g. ll27l fTll2ll24l. The best current data structures for orthogonal range reporting in <i-dimensional space 
(d > 3), where coordinates can only be compared, either answers queries in 0(lg B n(lg n/ lg \g B n) d ~ 2 + 
k/B) I/Os and uses 0(n(lgn/ lglg B n) d ~ l ) space, or answers queries in 0(lg B nlg d ~ 3 n + k/B) I/Os and 
uses 0(n(lgn/ lg lg B n) 3 lg d ~ 3 n) space 0. 

All these results in some sense do not exploit the full power of the underlying machine. While this 
provides for easier lower bounds, it should be clear when comparing to our results, that this approach might 
come at a cost of efficiency. Finally, we note that recent work by Iacono and Patra§cu ll20l also focuses on 
obtaining stronger upper bounds (for dynamic dictionaries) in the I/O model by abandoning the indivisiblity 
assumption. 

2 Three-Sided Orthogonal Range Reporting 

In this section we describe our data structure for three-sided orthogonal range reporting. Recall that in this 
problem we are interested in reporting the points in [x\, X2] x (—00, y\. If there are k such points, our data 
structure answers queries in 0(1 + k/B) I/Os and uses linear space for input points in rank space, i.e., when 
the input points lie on the grid [n] x [n] . We set out with a brief preliminaries section, and then describe our 
data structure in Section 12.21 

2.1 Preliminaries 

In this section, we briefly discuss two fundamental data structures that we make use of in our solutions, the 
Fusion Tree of Fredman and Willard 1 16] and the External Memory Priority Search Tree (EM-PST) of Arge 
et al. ifTOll . For the EM-PST, we present a parametrized version of the original data structure, while for the 
Fusion Tree, we merely state the result and use it as a black box. 

Fusion Tree. Allowing for full access to the bits in a disk block yields more efficient data structure solu- 
tions to several fundamental problems. In this paper, we use the Fusion Tree of Fredman and Willarcfj]: 

Lemma 1 There exists a linear space data structure that supports predecessor search in 0(lg b n) I/Os on 
n elements in a universe of size u, where lg u < b. 

The requirement lg u < b ensures that we can store an element in a single disk block. In the original 
Fusion Tree data structure, much care is taken to implement various operations on words in constant time. 
However this is trivialized in the I/O model, since we can do arbitrary computations on a disk block in main 
memory free of charge. 

External Memory Priority Search Tree. The EM-PST is an external memory data structure that answers 
three-sided range queries. We describe the basic layout of the EM-PST, parametrized with a /eaf-parameter £ 
and a branching-parameter f. 

An EM-PST is constructed from an input set of n points in the following recursive manner: Select 
the B points with smallest y-coordinate among the remaining points and place them in a root node. Then 
partition the remaining points into / equal-sized consecutive groups wrt. the x-coordinates of the points. 

1 Strictly speaking, we use what Fredman and Willard refer to as g-heaps, which generalize the more well-known fusion trees. 
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Recurse on each group, and let the recursively constructed trees be the children of the root node, ordered by 
^-coordinates (/ — 1 splitter keys are stored in a Fusion Tree at the root). We end the recursion when the 
number of points in a subproblem drops to between £ and f£ + B, and instead create a leaf node. Note that 
the EM-PST is both a heap on the y-coordinates and a search tree on the x-coordinates, hence it's name. 
Furthermore, the height of the tree is 0(lgf(n/£)), and there are 0(n/£) nodes in the tree. 

The original data structure of Arge et al. has leaf-parameter £ = B and branching parameter f = B. 
They augment each node of the tree with a number of auxiliary data structures to support fast queries. We 
omit the description of these data structures, as we will only use the basic layout of the tree in our solutions. 

2.2 Data Structure 

In this section, we describe our optimal data structure for three-sided range reporting in rank space. At a 
high level, our data structure places the points in an EM-PST, and augments each leaf with auxiliary data 
structures. These allow efficient reporting of all points in a query range [x\, X2] x (— 00, y] that are associated 
with nodes on the path from the root to the corresponding leaf. Since the number of points associated with 
nodes on such a path is rather small, we are able to answer these queries in 0(1 + k/B) I/Os. To report the 
remaining points in a query range, we exploit that for each node not on the path to either of the two leaves 
containing the predecessors of x\ and xi, either all the points associated with the node have an x-coordinate 
in the query range, or none of them have. We now give the full details of our solution. 

The Data Structure. Suppose there exists a base data structure for three-sided range reporting that uses 
linear space and answers queries in 0(1 + k/B) I/Os when constructed on points with coordinates 
on the grid [n] x [n]. Let S be the input set of n points, and construct on S the EM-PST layout described 
in Section 12.11 using branching parameter / = 2 and leaf -parameter £ = Big 2 n. Denote the resulting tree 
by T. For each leaf v in T we store two auxiliary data structures: 

1. A base data structure on the 0(f£ + B) = 0(B lg 2 n) = 0(b 2 ) points associated with v. 

2. For each ancestor w of v, we store a base data structure on the 0(fB lg n) = 0(b) points associated 
with all nodes that are either on the path from w to the parent of v, or is a sibling of a node on the 
path. We furthermore augment each of the points with the node it comes from. 

Finally, we augment T with one additional auxiliary data structure. This data structure is a simple array 
with n entries, one for each x-coordinate. The ith entry of the array stores a pointer to the leaf containing 
the x-predecessor of i amongst points associated with leaves of T. If i has no such predecessor, then we 
store a pointer to the first leaf of T. 

Query Algorithm. To answer a query [xi,X2] x (— 00, y], we first use the array to locate the leaf u\ 
containing the predecessor of x\, and the leaf U2 containing the predecessor of x%. We then locate the 
lowest common ancestor LCA(ui, U2) of u\ and U2- Since T is a balanced binary tree, LCA(ui, U2) can be 
found with 0(1) I/Os if each node contains O(lgn) bits describing the path from the root. 

We now query the base data structures stored on the points in u\,U2 and their sibling leaves. This reports 
all points in the query range associated with those nodes. We then query the second base data structure 
stored in u\, setting w to the root of T (see 2. above), and the second base data structure stored in 112 setting 
w to that grandchild of LCA(iii, U2) which is on the path to ug. Observe that this reports all points in the 
query range that are either associated to nodes on the paths from the root to u\ or U2, or to nodes that have 
a sibling on the paths. 
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We now exploit the heap and search tree structure of T: For a node v not on the paths to u\ and U2, but 
with a sibling on one of them, we know that either the entire subtree rooted at v stores only points with an 
x-coordinate inside the query range, or none of the points in the subtree are inside the query range. Further- 
more, the heap ordering ensures that the points with smallest y-coordinate in the subtree are associated to 
v. Thus if not all B points associated to v was reported above, then no further points in the subtree can be 
inside the query range. We thus proceed by scanning all the reported points above, and for each node v not 
on the paths to u\ and 112, we verify whether all B associated points were reported. If this is the case, we 
visit the subtree rooted at v in the following recursive manner: 

If the children of v are not leaves, we scan the B points associated to both of them, and report those with 
a y-coordinate inside the query range. If all B points associated to a child are reported, we recurse on that 
child. If the children are leaves, we instead query the base data structure stored on the associated points and 
terminate thereafter. 

As a side remark, observe that if we mark the point with largest y-coordinate in each node, then all B 
points associated to a node with a sibling on one of the two paths are reported, iff the marked point associated 
to the node is reported. Thus the above verification step can be performed efficiently. 

Analysis. The base data structures stored on the points in the leaves uses 0(n) space in total, since each 
input point is stored in at most one such data structure. There are 0{n/i) = 0(n/(B lg 2 n)) leaves, each 
storing 0(lgj n) = 0(lg n) data structures of the second type. Since each such data structure uses 0(B lg n) 
space, we conclude that our data structure uses linear space in total. 

The query cost is O(l) I/Os for finding u\,U2 and LCA(«i, 112). Reporting the points in u\,U2 and 
their siblings costs 0(1 + k/B) I/Os. Querying the second base data structures in u\ and U2 also costs 
0(1 + k/B) I/Os. Finally observe that we only visit a node not on the paths to u\ or 112 if all B points 
associated to it are reported. Since we spend O(f) = O(l) I/Os visiting the children of such a node, we 
may charge this to the output size, and we conclude that our data structure answers queries in 0(1 + k/B) 
I/Os, assuming that our base data structure is available. 

Lemma 2 If there exists a linear space data structure for three-sided range reporting that answers queries 
in 0(1 + k/B) I/Os on points on the grid [n] x [n], then there exists a linear space data structure for 
three-sided range reporting on n points in rank space, that answers queries in 0(1 + k/B) I/Os. 

2.3 Base Data Structure 

In the following we describe our linear data structure that answers three-sided range queries in 0(1 + k/B) 
I/Os on & W points on the grid [n] x [n]. We use two different approaches depending on the disk block size: 

If B = ^(lg 1 / 16 n), then we use the EM-PST of Arge et al. This data structure uses linear space and 
answers queries in 0(lg B bPW + k/B) = 0(1 + k/B) I/Os since b = e(Blgn) = 0{B 17 ). 

The hard case is thus when B = ©(lg 1 / 16 n), which implies b = o(lg 17 ^ 16 n). We first show how we 
obtain the desired query bound when the number of input points is 0(b l l H ), and then extend our solution to 
any constant exponent. 

2.3.1 Very Few Input Points. 

In the following we let m = 0(6 1 / 8 ) = o(lg 17//108 n) denote the number of input points. Note that if we 
reduce the m points to rank space, then we can afford to store a table with an entry for every possible input 
and query pair. Unfortunately, we cannot simply store all answers to queries, as we would need to map the 
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reported points back from rank space to their original coordinates, potentially costing 0(1) I/Os per reported 
point, rather than 0(1/ B) I/Os. Our solution combines tabulation with the notion of partial persistence to 
achieve the desired query cost of 0(1 + k/B) I/Os. 

A dynamic data structure is said to be partially persistent if it supports querying any past version of the 
data structure. More formally, assume that the updates to a dynamic data structure are assigned increasing 
integer IDs from a universe \u\. Then the data structure is partially persistent if it supports, given a query q 
and an integer i £ [u], answering q as if only updates with ID at most i had been performed. We think of 
these IDs as time, and say that an update with ID % happened at time i. 

We observe that a partially persistent insertion-only data structure for Id range reporting solves the 
three-sided range reporting problem: Sweep the input points to the three-sided problem from smallest in- 
coordinate to largest. When a point p = (x, y) is encountered, we insert p into the Id data structure using 
x as its coordinate, and y as the time of the insertion. To answer a three-sided query [x±, X2] x (—00, y], 
we query the Id data structure with range [x\, X2] at time y. This clearly reports the desired points. In the 
following we therefore devise an insertion-only Id range reporting data structure, and then show how to 
make it partially persistent. 

Id Insertion-Only Range Reporting. Our Id data structure consists of a simple linked list of disk blocks. 
Each block stores a set of at most B — 1 inserted points, and we maintain the invariant that all coordinates 
of points inside a block are smaller than all coordinates of points inside the successor block. Note however 
that we do not require the points to be sorted within each block. Initially we have one empty block. 

When a point p is inserted with coordinate x, we scan through the linked list to find the block containing 
the predecessor of x. We add p to that block. If the block now contains B points, we split it into two blocks, 
each containing B/2 points. We partition the points around the median coordinate, such that all points inside 
the first block have smaller coordinates than those in the second. This clearly maintains our invariant. 

To answer a query [2:1,2:2] we assume for now that we are given a pointer to the block containing the 
predecessor of x±. From this block we scan forward in the linked list, until a block is encountered where 
some point has coordinate larger than x<i- In the scanned blocks, we report all points that are inside the query 
range. Because of the ordering of points among blocks, this answers the query in Oil + k/B) I/Os. 

Partial Persistence. We now modify the insertion algorithm to make the data structure partially persistent: 

1 . When a block splits, we do not throw away the old block, but instead maintain it on disk. We also 
assign increasing integer IDs to each created block, starting from 1 and incrementing by one for each 
new block. 

2. We store an array with one entry for each block ID. Entry i stores a pointer to the block with ID i. 

3. Whenever a successor pointer changes, we do not throw away the old value. Instead, each block 
stores (a pointer to) a Fusion Tree that maintains all successor pointers the block has had. Thus when 
a successor pointer changes, we insert the new pointer into the Fusion Tree with key equal to the time 
of the change. 

4. Finally, we augment each point with its insertion time. 

Note that once a block splits, the old block will not receive further updates. The blocks that have not yet 
split after a sequence of updates constitutes the original data structure after the same updates, and these are 
the only blocks we update during future insertions. 
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We answer a query [x\ , xq\ at time y in the following way: Assume for now that we can find the ID of the 
block that contained the predecessor of x\ at time y. Note that this block might have split at later updates, 
but is still stored in our data structure. We use the array and the ID to retrieve this block, and from there 
we simulate the original query algorithm. When the original query algorithm requests a successor pointer, 
we find the predecessor of y in that block's Fusion Tree, yielding the successor pointer of the block at time 
y. We stop when a block containing a point inserted at time at most y and having coordinate larger than X2 
is encountered. For all the blocks scanned through this procedure, we report the points with coordinate in 
the query range, that was inserted at time at most y. This answers the query correctly as the blocks scanned 
contain all points that would have been scanned if the query was asked at time y, except possibly some more 
points inserted after time y. 

By Lemma[T]we get that each predecessor search costs O(l) I/Os since we have at most m = OQ) 1 / 8 ) 
successor pointer updates. Thus the total query cost is 0(1 + k/B) I/Os since the successor of a block at time 
y contains at least B/2 points that were inserted at time at most y, which allows us to charge the predecessor 
search to the output size. To argue that the space of our data structure is linear, we charge splitting a block 
to the at least B/2 insertions into the block since it was created. Splitting a block increases the number of 
used disk blocks by 0(1), so we get linear space usage. 

Three-Sided Range Reporting. The only thing preventing us from using the partially persistent data 
structure above to solve the three-sided range reporting problem, is a way to find the ID of the block con- 
taining the predecessor of x\ at time y. For this, we use tabulation. 

The key observation we use for retrieving this ID is that if we reduce the coordinates of the input points 
for the three-sided range reporting problem to rank space before inserting them into the partially persistent 
Id data structure, then only the coordinates of points and the update time of each successor pointer changes, 
i.e., the ID of the block containing a particular point remains the same. Thus we create a table with an entry 
for every input and query pair to the three-sided range reporting problem in rank space. Each entry of the 
table contains the ID of the block in the partially persistent Id structure from which the query algorithm 
must start scanning. 

To summarize, we solve the three-sided range reporting problem by building the partially persistent Id 
data structure described above. We then store two Fusion Trees, one on the x-coordinates and one on the 
y-coordinates of the input points. These allows us to reduce a query to rank space. Furthermore, we store a 
table mapping input and query pairs in rank space to block IDs. Finally, we store an integer that uniquely 
describes the input point set in rank space (such an integer easily fits in lg n bits). To answer a query 
[xi, x%] x (—00, y], we use the Fusion Trees to map the query to rank space wrt. the input point set. We then 
use the unique integer describing the input set, and the query coordinates in rank space, to perform a lookup 
in our table, yielding the ID of the block containing the predecessor of x\ at time y. Finally, we execute the 
query algorithm for the original query (not in rank space) on the partially persistent Id data structure, using 
the ID to locate the starting block. 

Since there are m = 0(6 1//s ) input points, we get reduction to rank space in 0(1) I/Os by LemmaQ] 
Querying the Id data structure costs 0(1 + k/B) I/Os as argued above, thus we conclude: 

Lemma 3 There exists a linear space data structure for three-sided range queries on 0(6 1 ^ 8 ) points on the 
grid [n] x [n], that answers queries in 0(1 + k/B) I/Os. 
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2.3.2 Few Points 



In this section we extend the result of the previous section to give a linear space data structure that answers 
three-sided range queries in 0(1 + k/B) I/Os on points when B = o(lg 1//16 n). 

Let m = b°^> denote the number of input points. We construct the EM-PST layout described in 
Section |2?T1 on the input points, using branching parameter / = 6 1 / 16 and leaf parameter £ = B. The height 
of the constructed tree T is 0(1). In each internal node v of T, we store the data structure of Lemma [3] on 
the O(fB) = 0(6 1//s ) points associated with the children of v. 

We answer a query [x±, xq\ x (—00, y] using an approach similar to the one employed in Section I2T21 We 
first find the two leaves u\ and ui containing the predecessors of x\ and X2 among points stored in leaves. 
This can easily be done using the Fusion Trees stored on the split values in each node of T. To report the 
points in the query range that are associated with a node on the two paths from the root to u\ and U2, we 
simply scan the points associated with each node on the paths. What remains are the subtrees hanging off 
the query paths. These are the subtrees with a root node that is not on the query paths, but which has a 
parent on the query paths. These subtrees are handled by first traversing the two paths from LCA(«i, U2) to 
u\ and U2 (subtrees hanging off at a higher node cannot contain points inside the x -range of the query). In 
each node v on these paths, we query the data structure of Lemma [3] with a slightly modified query range: 
In a node v on the path to u\, we increase x\ to not include ^-coordinates of points in the child subtree 
containing u\. Similarly, on the path to U2, we decrease X2 to not include ^-coordinates of points in the 
child subtree containing u-i- This can easily be done by a predecessor search on the split values. We finish 
by recursing into each child from which all B points were reported. Here we again query the data structure 
of Lemma[3]and recurse on children where all points are reported. 

For the subtrees hanging off the query path, we only recurse into a node v if all B points associated to v 
are reported. There we spend 0(1 + k' /B) I/Os querying the data structure of Lemma [3] where k' denotes 
the output size of the query among points associated to children of v. We charge all this to the output size. 
Finally, the height of T is 0(1), and we spend O(l) I/Os in each node on the paths from the root to u\ and 
U2, thus we conclude that our data structure answers queries in 0(1 + k/B) I/Os. For the space, simply 
observe that each points is stored in only one data structure of Lemma[3] We therefore get 

Lemma 4 There exists a linear space data structure for three-sided range reporting on b°^> points on the 
grid [n] x [n], that answers queries in 0(1 + k/B) I/Os. 

If we combine Lemma[2]and 0]we finally get our main result: 

Theorem 1 There exists a linear space data structure for three-sided range reporting on n points in rank 
space, that answers queries in 0(1 + k/B) I/Os. 

3 Colored range and prefix reporting 

Colored range reporting. Our optimal solution for three-sided range reporting in rank space immediately 
gives an optimal solution to the one-dimensional colored range reporting problem in rank space. In this 
problem, we are given sets C\, . . . , C m C {1, . . . , a}, and are to preprocess them into a data structure that 
supports queries of the form: Given indices (a,b), report the set Ua<i<6^- ^ e think of each Cj as an 
ordered set of colors, hence the name colored range reporting. 

Our optimal solution to this problem follows from a simple and elegant reduction described in |[T9l . 
Below, we present the reduction as it applies to our problem. 
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First think of each set Cj as a set of one-dimensional colored points {(i,c) \ c £ Cj}, where j denotes 
the coordinate of a point and c the color. We transform these sets of points into a one-dimensional point set 
S in rank space, simply by using the colors as secondary coordinates. More formally, we replace each point 
(i, c) by the point (i', c) where i' = |(Cj)< c | + Z)}=i Here |(Cj)< c | denotes the number of colors in C, 
that are less than or equal to c. Finally, we transform S into a two-dimensional point set S without colors, 
by mapping each colored point (i',c) G S to the two-dimensional point (i',pred(i', c)) where pred(z',c) 
denotes the coordinate of the predecessor of (i 1 , c) amongst points in S with color c. 

To answer a query (a, b), we simply ask the three-sided query [a 1 , b'] x (—00, a' — 1] on S. Here 
a' = 1 + X^j=i I Cj I an d b' = Y^j=i I I - The correctness follows from the fact that for each color c 
with a point in the range [a, b], precisely one point in S with color c is inside the range [a',b'], and also 
has a predecessor of color c before a'. It follows that precisely the same point is inside the query range 
[a',b'] x (—00, a' — 1] in S, and thus we report one point for each color inside the query range [a, b}. If we 
augment points in S with the color of the point mapping to it, this returns the set of colors inside the query 
range. Note that the values a' and b' can be obtained by table lookups in 0(1) I/Os, so we can conclude: 

Theorem 2 There is a linear space data structure for one-dimensional colored range reporting in rank 
space that answers queries in 0(1 + k/B) I/Os. 

Colored prefix reporting. We now consider the following problem: Given a set S of strings, each 0(1) 
disk blocks in length, and a function c : S — > 2^ 1 ' - ,CT ^, support queries of the kind: Given a string p, report 
the set Uzgsnp* c ( x )> where p* denotes the set of strings with prefix p. Building on work of Alstrup et 
al. 0, Belazzougui et al. lfT3l have shown the following: 

Theorem 3 Given a collection Sofn strings, there is a linear space data structure that, given a string p of 
length 0(B), returns in 0(1) I/Os: 1 ) The interval of ranks (within S) of the strings in S n p*, and 2) The 
longest common prefix of the strings in S n p*. 

In particular, the first item means that we have a reduction of prefix queries to range queries in rank 
space that works in 0(1) I/Os and linear space. Combined with Theorem |2]this implies an optimal solution 
for colored prefix reporting, for prefixes of length 0(B). 

4 Top-Zc colored prefix reporting 

Suppose that we are interested in reporting just the k first colors in {1, . . . , a} that match a prefix query p (of 
length O(B)), where k is now a parameter of the data structure. We use the notation top fe (S) to denote the 
largest k elements of a set S (where top fc (S) = S if |S| < k). The techniques in the previous sections do 
not seem to lead to optimal results in the I/O model. However, if we consider a more powerful, yet arguably 
realistic, model with parallel data access, it turns out that a simple data structure provides optimal bounds. 

Scatter-I/O model. We consider a special case of the parallel disk model [28 1 where there are B disks, 
and each block contains a single word. (Notice that we use B differently than one would for the parallel I/O 
model.) A single I/O operation thus consists of retrieving or writing B words that may be placed arbitrarily 
in storage. To distinguish this from a normal I/O operation, we propose the notation si/0 (for scatter I/Os). 
This model abstracts (and idealizes) the memory model used by IBM's Cell architecture ifTSl . which has 
been shown to alleviate memory bottlenecks for problems such as BFS [25 ] that are notoriously hard in the 
I/O model [22). 



9 



4.1 Our data structure 



We construct a collection S' k consisting of prefixes of strings in S. For each p € SI we store the color set 
c k(p) = to Pfc(Ua,eSnp* c ( x )) i n sorted order. Recall that we use p* to denote the set of all strings with prefix 
p. Given a prefix p, there is a minimal subset S p C S k n p* that covers p in the sense that any string in 
S fl p* has a string in S p as a prefix. This means that the result of a query for p is top k (\J x< - Sp c(x)). 

Choice of Si. We give an inductive definition of S' k , where prefixes of strings in S are considered in de- 
creasing order. A prefix p is included in S' k if either p € 5 or the following condition holds: YlxeS \ck(x)\ > 
2 1 UzeSp c k{x)\- Since S p depends only on prefixes longer than p, this is well-defined. Intuitively, we build 
a data structure for (p) if this will reduce the cost of reading all elements in (p) by a factor of more 
than 2. This happens if more than half of the elements in the multiset [j x( z Sp Ck(x) are duplicates. 

Space usage. An accounting argument shows that the total space usage for these lists is 0(^2 X&S |cfc(x)|), 
i.e., linear in the size of the data: For each x G S place |cfc(x)| credits on x — this is Ylxes \ c k( x ) \ credits 
in total. If we build a data structure for a prefix p, the lists merged to form it have a total of at least 2|c&(p) | 
credits. This is enough to pay for the space used, Ck{p) credits, and for placing [cfc(p)| credits on p. By 
induction, we can pay for all lists constructed using the original credits, which implies linear space usage. 

Support structures. To support efficient search for the nodes that cover a given query we consider the 
compacted trie which consists of all branching nodes in the trie of S. Any colored prefix query p or i g can 
be converted into a query for p, where p is the path leading to the highest branching node having p OI i g as 
a prefix. By Theorem [3] we can find the string p in 0(1) I/Os, using that p has length 0(B). For each 
branching node p we store the sequence of pointers to the nodes S p C S k that cover p. There can be at most 
0(B) pointers to a given node, one for each ancestor in the trie, i.e., 0(mB) pointers in total. Also, we 
store the number of elements that should be reported from each node in S p . More specifically, we store the 
list of non-zero numbers, with corresponding pointers. The total space for this is 0(mB). 

Queries. On a query we first locate an equivalent branching prefix p, as outlined above. Then we retrieve 
the list of nodes in S p from which results should be retrieved, and the number of elements from each. This 
requires 0(1 + k/B) I/Os. Note that the total number of elements, including duplicates, is at most 2k, 
because if it was larger a merged list would have been created. Finally, since we now know 0(k) positions 
in memory containing all k elements to be reported, it is trivial to retrieve them in 0(k/B) sI/Os. This is 
the only point where we use the extra power of the model — all previous steps involve standard I/Os. 

5 Open problems 

An obvious question is if our results for the I/O model can be extended to the cache-oblivious model lfl8l . 
Also, it would be interesting to investigate whether our results can be obtained with the indivisibility as- 
sumption, or if the problem separates the I/O model with and without the indivisibility assumption. Finally, 
it would be interesting to see if top-A; colored prefix (and range) reporting admits an efficient solution in the 
I/O model, or if the top-fc version separates the I/O and scatter I/O models. 

Acknowledgement. We thank for ChengXiang Zhai for making us aware of query relaxation in information 
retrieval. 
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